[Project] Determining Literary Climaxes in Malazan

Chris Peralta

2019-11-22

Abstract

Note: All bolded, underlined words are hyperlinks. In this project, I attempt to find the climaxes of the series Malazan Book of the Fallen by using network, text, and sentiment analysis. This series is notable in that it is one of the most complex and long fantasy series with a continuous single plot-line. There are 3,252,031 words in the series and I estimated that there are at least 1,314 characters in the series with approximately 457 unique points of view. Additionally, many of the characters have multiple aliases and nicknames adding another layer of complexity. A character might go by completely different names in different novels.

I began by text mining the co-occurrence data from the books, I’ll elaborate more on that process later. From there, I had to get the co-occurrence data into a reasonable format and clean the name data. I then used the AFINN Lexicon to get the sentiment data. Finally, I compared the sentiment data with the network data to try to see which works better to determine the climax of each of the 10 books.

The AFINN Lexicon contains 2,476 words with negativity and positivity scores between -5 and 5.

Data Preparation

Numerous datasets from several sources were used in this project. The co-occurrence data was mined from the books by me. The books were converted from .epub format to .txt format. I made most of the alias data manually and crowd-sourced some of the aliases on Reddit. The name data was extracted from the Dramatis Personae sections at the start of each book and manually extracted from the Malazan Wiki.

I began by joining the character name data with the alias data into a single dataset. I then split all of this data by spaces in order to get variations of the names and rejoined the partial name data back to the full name data to get a comprehensive dataset of full and partial names. I then filtered out stop words, formal titles, military ranks, and commonly capitalized words that aren’t names from this list. Then, I arranged the list by string length.

Methodology

At this point, I went back to the book data and turned the text into ngrams of length 20 by book and chapter. I chose to use ngrams so that I would get the full co-occurrence relations within the 20 word groups. For example, “Ron Jon” then “Ron Jon Bob” and finally “Jon Bob”. Then I wrote a function that tries to extract every name in the name list from the ngram. If there is a match, then it also removes the match from the string for subsequent iterations. Otherwise, “Brys Beddict” would be extracted 3 times. Some of the code for this process will be shown in the appendix.

This method of extraction was computationally intensive as there were 3,080 names in my name list and 3,250,530 ngrams. After several iterations of my code, I managed to get it to run in around 40 hours on my laptop. The original speed was about 310 seconds per 1600 ngrams, and I reduced that to approximately 70 seconds per 1600.

Once the co-occurrence data was in a usable format, I moved on to data cleaning. Much of the data cleaning was due to the fact that there were partial name matches due to the use of ngrams “John Smith” matched as “John” and because some characters have up to 9 different names. Using regular expressions, I formatted and removed variations of all names with over 100 appearances in the network data. I made three assumptions at this stage: Note, when I use the word “importance” I am generally talking about something with a high centrality measure.

Results

I’ll jump right into it by giving the top 10 most important characters using PageRank compared with their degree centrality importance. I only used un-directed network graphs for everything that follows.

Character Ranking - Degree Centrality Ranking - PageRank
Tavore Paran 1 1
Ben Adaephon Delat 2 2
Hood 5 3
Ganoes Stabro Paran 4 4
Fiddler 3 5
Kalam Mekhar 6 6
Whiskeyjack 8 7
Rhulad Sengar 10 8
Gesler 7 9
Anomander Rake 14 10

Degree centrality and PageRank mostly agree on the most important characters in the series, but they start to greatly diverge the further they get away from the top 10. As shown in the figure to the right.

In the following network graph, I decided to use PageRank after experimenting with other centrality measures.

The only interesting observation I can make from this graph is that there is a distinction between the two main continents from the 4 first books in the series and the third main continent introduced in book 5. Also, these groups become more connected in books 7, 8, 9, and 10.

Edge betweenness centrality looks at the shortest paths through the network that go through each edge, and assigns each edge a value based on how much each edge “connects” the entire network. Removing an edge with high edge betweenness centrality will greatly impact the entire network.

Before I begin trying to find the best way to indicate the climaxes in the series, I’ll check if a chapter’s sentiment score is correlated with importance. I initially thought that either of them could be used to predict climaxes on their own and that they may be correlated. The importance of the chapters was calculated by taking the mean of the edge betweenness centrality values of all of the edges in each chapter.

Since nearly every chapter in this book has a negative sentiment score, we will use negative values as an indicator of importance.

It appears that there is little to no correlation between a chapter’s mean sentiment and mean edge betweenness centrality.

Now, I’m going to see if the mean edge betweenness centrality weighted with the total sentiment score of each chapter will find the major climaxes of each book. Note, the most common emotions weren’t used to calculate sentiment scores or to find any of the climaxes.

Book Chapter Mean importance Sentiment score Weighted importance Most common emotion
1 2 47.09545 -319 -15023.45 Fear
2 7 47.20600 -382 -18032.69 Fear
3 17 36.49187 -679 -24777.98 Fear
4 2 48.44484 -779 -37738.53 Fear
5 25 43.26630 -762 -32968.92 Fear
6 7 22.69460 -1378 -31273.16 Fear
7 9 69.81427 -597 -41679.12 Fear
8 22 141.07185 -689 -97198.50 Fear
9 15 139.05500 -898 -124871.39 Fear
10 23 46.94171 -1246 -58489.38 Fear

While all of these chapters may be considered climaxes, I would say that only main climaxes are for books 5, 8, and 10. The interpretation of the “weighted importance” value, could be that these are the chapters with the largest negative impact on the series as a whole because many important characters are present and the chapters are very negative. If you read the series as a whole, this sounds quite plausible with the only exception being book 10, chapter 23.

Chapter summaries can be found here:
- Book 1: Gardens of the Moon
- Book 2: Deadhouse Gates
- Book 3: Memories of Ice
- Book 4: House of Chains
- Book 5: Midnight Tides
- Book 6: The Bonehunters
- Book 7: Reapers Gale
- Book 8: Toll the Hounds
- Book 9: Dust of Dreams
- Book 10: The Crippled God

Lets look at only the mean edge betweenness centrality on its own.

Book Chapter Mean Importance Most common Emotion
1 6 163.52415 Trust
2 1 90.71905 Fear
3 8 58.17383 Trust
4 5 115.19523 Trust
5 22 77.69165 Fear
6 13 56.99112 Fear
7 7 76.33896 Fear
8 22 141.07185 Fear
9 13 204.90669 Fear
10 12 130.86665 Fear

The only main climax here is for Book 8, but the chapters for books 5 and 8 could be considered sub-climaxes. So just plain edge betweenness centrality doesn’t quite work for finding the climaxes of each book. Book 2, chapter 1 just has a lot of dialogue between important characters, but I don’t think it can be considered a climax of any degree. These are the chapters that have the highest mean edge centrality, meaning that the average edge has the highest edge betweenness centrality.

What about only using sentiment scores?

Book Chapter Sentiment score Most common emotion
1 4 -371 Fear
2 6 -495 Fear
3 25 -860 Fear
4 2 -779 Fear
5 25 -762 Fear
6 7 -1378 Fear
7 24 -1034 Fear
8 24 -710 Fear
9 15 -898 Fear
10 24 -1383 Fear

Well this appears to be the best metric by far for finding climaxes. A conclusion that can easily be drawn from this is that the most negative part of each book tends to be the climax of the book in this series and the most frequently occurring emotion in these chapters is “Fear”. Some of these chapters have major battles, revolutions, and a couple horribly morbid chapters. Books 3, 5, 7, 8, and 10 all have what I believe are the main climaxes in their respective books, and Books 2, 4, 6, and 15 are all sub-climaxes.

Emotions in the NRC Lexicon:
- Fear
- Trust
- Sadness
- Anger
- Anticipation
- Joy
- Disgust
- Surprise

While simple positive and negative sentiment scores are the best metrics for finding the climax of each book, all of these methods pick out some very interesting chapters.

Appendices

Co-occurrence extraction

In this section, I’ll include some snippets of the code that I used to extract the co-occurrence data from the books.

The books were in this format, in a single data.frame, after being loaded into R and doing some pre-processing:

The book pre-processing code can be found here: https://github.com/visuelledata/malazannetwork/blob/master/R/import_books.R

#> # A tibble: 7 x 3
#>   line                                                        book chapter
#>   <chr>                                                      <dbl>   <int>
#> 1 PROLOGUE                                                       1       0
#> 2 1154th Year of Burn’s Sleep 96th Year of the Malazan Empi~     1       0
#> 3 I                                                              1       0
#> 4 THE STAINS OF RUST SEEMED TO MAP BLOOD SEAS ON THE BLACK,~     1       0
#> 5 The winds were contrary the day columns of smoke rose ove~     1       0
#> 6 Ganoes Stabro Paran of the House of Paran stood on tiptoe~     1       0
#> 7 For Ganoes, the ancient fortification overlooking the cit~     1       0

Starting with data in the ngram format below.

#> # A tibble: 10 x 3
#> # Groups:   book, chapter [1]
#>     book chapter ngram                                                    
#>    <dbl>   <int> <chr>                                                    
#>  1     1       0 PROLOGUE 1154th Year of Burn’s Sleep 96th Year of the Ma~
#>  2     1       0 1154th Year of Burn’s Sleep 96th Year of the Malazan Emp~
#>  3     1       0 Year of Burn’s Sleep 96th Year of the Malazan Empire The~
#>  4     1       0 of Burn’s Sleep 96th Year of the Malazan Empire The Last~
#>  5     1       0 Burn’s Sleep 96th Year of the Malazan Empire The Last Ye~
#>  6     1       0 Sleep 96th Year of the Malazan Empire The Last Year of A~
#>  7     1       0 96th Year of the Malazan Empire The Last Year of Ammanas~
#>  8     1       0 Year of the Malazan Empire The Last Year of Ammanas’s Re~
#>  9     1       0 of the Malazan Empire The Last Year of Ammanas’s Reign I~
#> 10     1       0 the Malazan Empire The Last Year of Ammanas’s Reign I TH~

I then broke the ngram data into 10 separate lists, 1 for each book, each list contained a separate data frame for each chapter. I’ll leave this output out as it will be too long.

Below is the main function I used to extract the co-occurrence data:

library(future.apply)

process_data <- function(text, book_num, chap_num){
  tictoc::tic() # For execution time
  plan(multiprocess, workers = 4) # To set parallelization parameters
  
  # Pulls out the co-occurrence data
  placeholder <- text %>%
    future_apply(1, function(x){ # future_apply applies func to each row of text
                      map_chr(all_names, # map_chr plugs names into the function
                        function(pat){ # Finds names
                          name <- str_extract(x, pattern = fixed(pat)) 
                          #The line below removes any matches from the string
                          if(!is.na(name)) x <<- str_remove(x, pattern=fixed(pat)) 
                          name # So the function returns the co-occurrence data
                         }
                        )
                      },
                 future.seed = TRUE) # Outputs a matrix
  
  # Formats the data into a "tidy" format and does some basic cleaning
  placeholder %>% 
    t() %>% # Transposes matrix
    as.tibble() %>% # Converts it to a data frame
    unite("names", V1:V3080, sep = ";") %>% # Combines all cols into one
    remove_NAs() %>% # Removes all generated NAs, leaves only names
    mutate(book = book_num, # Adds book and chapter data
           chapter = chap_num) %>%  
    write_rds(paste0("data/network_data/network_data", # Writes data
                     book_num, "-", chap_num,   
                     ".rds"), 
              compress = "none")
  
  tictoc::toc() # Gets execution time
  return(NULL)
}

The process_data() function is then ran in 10 different for loops, one per book, that iterate over the chapters to write rds files containing all of the co-occurrence data. One for loop is shown below.

for (i in seq_along(book1)){
  process_data(text = book1[[i]], book_num = 1, chap_num = i) 
}

Bibliography